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EXECUTIVE  SUMMARY 


Classification  applications  require  the  extraction  of  class  discriminative  information. 
However,  this  step  often  leads  to  high-dimensional  feature  spaces,  which  requires  large  datasets 
to  create  viable  classification  schemes.  This  study  presents  follow-on  work  to  those  of  Duzenli 
[DUZ98]  and  San  Pedro  [SANOO],  and  considers  two  discriminant-based  feature  dimension 
reduction  schemes  for  classification  applications.  The  two  feature  reduction  schemes  considered 
are  the  Mahalanobis-based  dimension  reduction  (MBDR)  scheme  recently  proposed  by  Brunzell 
[BRE99],  and  the  kernel-based  generalized  discriminant  analysis  approach  (GDA)  proposed  by 
Baudat  &  Anouar  [BANOO],  The  GDA  is  part  of  a  new  breed  of  kernel-based  algorithms  that  are 
currently  being  considered  by  the  research  community  to  develop  new  learning  techniques,  as 
they  can  be  used  to  derive  nonlinear  generalizations  of  currently  known  algorithms.  Finally,  the 
classical  PCA  and  the  MSNN  proposed  earlier  in  [DUZ98]  are  included  in  this  study  for 
comparison  purposes. 

The  four  feature  dimension  reduction  schemes  considered  were  implemented  in 
MATLAB  and  evaluated  by  applying  the  transformed  features  to  a  basic  minimum  distance 
classifier.  Performances  are  evaluated  by  applying  these  schemes  to  three  datasets  commonly 
used  in  statistics  for  benchmarking  purposes.  Results  show  overall  best  results  to  be  obtained  for 
the  GDA  for  the  datasets  considered.  Results  also  show  there  is  no  consistent  second  best  feature 
reduction  scheme  among  the  MSNN,  the  MBDR,  and  the  PCA,  as  performances  for  these  three 
schemes  are  data  dependent. 
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I.  INTRODUCTION 


A.  BACKGROUND 

This  work  presented  in  this  report  is  part  of  a  larger  scale  study  conducted  during  1999 
and  2000  where  we  investigated  various  feature  extraction  and  dimension  reduction  schemes, 
and  their  application  to  the  classification  of  digital  modulation  types.  The  overall  study  was 
divided  in  three  separate  phases. 

•  The  first  phase  of  the  overall  study  investigated  extensions  to  the  MSNN  approach  originally 
derived  in  1998  [DUZ98]  to  include  variance  information  in  the  optimization  criterion. 
Results  obtained  with  synthetic  data  and  basic  communication  schemes  were  presented  in 
San  Pedro  [SAN00].  Results  showed  no  significant  improvements  over  the  original  MSNN 
for  the  data  investigated. 

•  The  second  phase  of  the  overall  study,  which  this  report  specifically  focuses  on,  investigated 
two  new  feature  dimension  reduction  schemes  and  their  resulting  performances  on 
benchmarking  datasets. 

•  The  third  phase  of  the  overall  study  investigated  the  application  of  a  selected  few  higher- 
order  statistic  parameters  to  the  classification  of  digital  modulation  schemes  of  types  [2,4,8]- 
PSK,  [2,4,8]-FSK,  and  [16,64,256]-QAM  in  low  SNR  levels  and  multipath  propagation 
channel  environments.  A  hierarchical  tree-based  classifier  was  proposed  and  its 
performances  studied  over  various  types  of  propagation  channels  [FAH01,  HAT01], 

B.  OBJECTIVES 

Extracting  relevant  features  that  allow  for  class  discrimination  is  the  first  critical  step  in 
classification  applications.  However,  this  step  often  leads  to  high-dimensional  feature  spaces, 
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which  requires  large  (and  potentially  not  available)  datasets  to  create  viable  classification 
schemes.  In  addition,  some  of  the  features  may  carry  little  useful  information  or  be  correlated 
with  others  resulting  in  redundancies  in  the  feature  space.  As  a  result,  there  is  a  strong  incentive 
to  reduce  the  feature  space  dimension.  Two  classical  types  of  approaches  to  reduce  feature 
dimension  exist:  Principal  Component  Analysis  (PCA)-based  or  discriminant-based  approaches. 
The  main  difference  between  the  two  types  lies  in  the  criterion  selected;  PCA-based  schemes 
seek  a  projection  direction  which  bests  represents  the  data  in  a  norm  sense,  while  discriminant- 
based  schemes  seek  a  projection  that  best  separates  the  class  data  [DHS01].  We  proposed  in 
earlier  work  a  simple  discriminant-based  feature  dimension  reduction  scheme  called  the  Mean 
Separator  Neural  Network  (MSNN).  The  MSNN  belongs  to  the  class  of  projection  pursuit 
algorithms,  where  the  goal  is  to  find  a  projection  direction  that  emphasizes  class  discrimination 
[BIS95].  Results  showed  the  MSNN  scheme  to  have  very  good  performances  for  the  underwater 
data  considered  during  this  earlier  study  [DUZ98,  DFA98,  FDU98].  The  MSNN  approach  can 
be  viewed  as  a  one-layer  neural  network  (NN)  implementation  where  the  goal  is  to  find  the 
projection  index,  i.e.,  the  weight  vector,  which  maximizes  the  absolute  difference  between  the 
means  of  the  projected  class  data.  As  a  result,  it  suffers  of  the  same  drawback  as  that  present  in 
numerous  other  NN  implementations:  the  iterative  procedure  is  not  insured  to  converge  to  the 
global  minimum  due  to  the  nonlinear  activation  function  present  in  the  optimization  criterion. 
While  the  “local  minima”  issue  was  shown  not  to  be  a  problem  for  the  data  investigated  in  our 
earlier  study,  it  motivated  this  follow-on  work  where  we  investigate  two  alternate  discriminant- 
based  dimension  reduction  schemes  which  do  not  exhibit  such  a  behavior. 

The  two  feature  reduction  schemes  considered  are  the  Mahalanobis-based  dimension 
reduction  (MBDR)  recently  proposed  by  Brunzell  [BRU97],  and  the  kernel-based  generalized 
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discriminant  analysis  (GDA)  proposed  by  Baudat  &  Anouar  [BANOO].  In  addition,  we 
benchmark  these  two  schemes  against  the  classical  PCA  approach,  and  the  MSNN  scheme. 

Chapter  II  briefly  reviews  the  PCA  approach,  as  applied  to  classification  applications. 
Chapters  III  and  IV  present  the  Mahalanobis-based  dimension  reduction  approach  and  kernel- 
based  generalized  discriminant  schemes  respectively.  The  basic  MSNN  scheme  is  described  in 
Chapter  V.  The  four  feature  dimension  reduction  schemes  considered  in  this  study  are 
implemented  in  MATLAB  and  evaluated  by  applying  the  transformed  features  to  a  basic 
minimum  distance  classifier.  Three  classification  datasets  commonly  used  in  statistics  for 
benchmarking  purposes  are  selected  to  compare  the  schemes  and  results  discussed  in  Chapter  VI. 
Finally,  Chapter  VII  presents  conclusions. 
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II.  PRINCIPAL  COMPONENT  ANALYSIS 


A.  INTRODUCTION 

Principal  Component  Analysis  (PCA)  is  one  possible  approach  to  reduce  the 
dimensionality  of  the  class  features  under  consideration.  The  method  projects  high-dimensional 
data  vectors  onto  a  lower  dimensional  space  by  using  a  projection  which  bests  represents  the  data 
in  a  mean  square  sense,  i.e.,  leads  to  projected  data  vectors  which  preserve  most  of  the  energy 
contained  in  the  original  data  [DHS01,  BIS95],  This  linear  dimension  reduction  scheme  uses  the 
Karhunen-Loeve  transformation  which  represents  a  given  data  vector  as  a  linear  combination  of 
the  eigenvectors  obtained  from  the  data  covariance  matrix.  As  a  result,  lower  dimensional  data 
vectors  may  be  obtained  by  projecting  the  high-dimensional  data  vectors  onto  a  user-specified 
number  of  eigenvectors  associated  with  the  largest  eigenvalues  of  the  data  covariance  matrix. 
PCA  is  widely  used  in  engineering  applications  such  as  for  example  in  compression  as  it 
preserves  most  of  the  original  overall  data  information,  and  in  statistics  where  it  can  be  applied 
to  decorrelate  data  prior  to  processing,  etc....  However,  the  PCA  projection  criterion  is  not 
necessarily  well  designed  for  classification  applications  where  the  goal  is  to  best  discriminate 
between  classes,  not  preserve  most  of  the  energy  in  a  lower  dimensional  class  feature  space. 
Nevertheless  it  is  a  classical  tool  applied  extensively,  and  we  will  use  it  in  our  comparison  of  the 
various  dimension  reduction  schemes  considered  in  this  study. 

B.  DESCRIPTION 

The  PCA  maps  an  ensemble  of  P  N-dimensional  vectors  X=[xIt....,xp]  onto  an  ensemble 
of  P  D-dimensional  vectors  Y=[v]y  ...,y_p]  where  D<N  using  a  linear  transformation  which  can  be 
represented  by  the  rectangular  matrix  A  so  that: 
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(2.1) 


yi=AHx;,  i  =  \,...,P, 

where  A  has  orthogonal  column  vectors.  For  PCA,  the  matrix  A  is  selected  as  the  P*D  matrix 
containing  the  D  eigenvectors  associated  with  the  larger  eigenvalues  of  the  data  covariance 
matrix  XHX.  With  such  a  choice  of  transformation  matrix  A,  the  transformed  data  vectors  y,-  have 
uncorrelated  components  as: 

'4  0  •••  O' 

„  o  4  : 

yyh  =  .  .  . 

:  0 

o  o  4 

where  ,  i=l,...D,  are  the  eigenvectors  of  the  data  covariance  matrix  XHX.  The  concept  of  PCA 


is  illustrated  next  by  considering  three  classes  of  two-dimensional  data,  as  shown  in  Figures  II- 1 
and  II-2,  where  the  data  dimension  is  to  be  reduced  to  one.  The  transformation  matrix  A  is  of 
dimension  2*1,  and  the  projected  data  sets  lie  on  a  line.  Figures  II-l  &  II-2  show  that  the  PCA 
projection  direction  preserves  most  of  the  signal  energy  but  also  generates  projected  data  with 
significant  amount  of  overlap  between  two  of  the  projected  class  data. 
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projected  data 


projected  data  histogram 
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Figure  II- 1.  Two-dimensional  PCA  projection,  example  1. 


two-dimensional  class  features  and  one-dimensional  PCA  projections 


Figure  II-2.  Two-dimensional  PCA  projection,  example  2. 


III.  MAHALANOBIS-BASED  DIMENSION  REDUCTION 


A.  INTRODUCTION 

As  shown  earlier,  PCA  may  not  be  well  suited  to  reduce  feature  dimension  in 
classification  applications  where  the  main  goal  is  to  preserve  class  discrimination.  Fisher’s 
linear  discriminant  (LDA)  introduced  by  Fisher  is  better  suited  as  it  seeks  a  projection  direction 
which  best  discriminates  between  the  classes  considered  [FUK90,  DHS01].  Fisher’s 
discriminant  was  initially  derived  for  the  two-class  problem  and  extended  later  to  the  more  than 
two-class  problem.  The  Fisher  projection  index  for  the  2-class  problem  is  derived  as  the  direction 
that  maximizes  the  following  ratio: 


u/iS„  w 

/(ad==7r=.  (3.i) 

w  Sww 

The  matrices  SB  and  Sw  respectively  represent  the  between-class  and  within-class  scatter 
matrices  defined  as: 


2 

SB  =  (w,  -m2)(m,  ~m2)T,  Sw  = 

i=l 

where  m,  and  m  respectively  represent  class-specific  means  for  classes  Ci  and  C2  and 
2,1  =  l,---,  2,  are  the  class-specific  data  covariance  matrices  for  the  two  classes  under 

consideration.  As  a  result  the  projection  criterion  aims  at  maximizing  a  ratio  of  the  separation 
between  projected  class  data  and  the  projected  class-specific  data  variance  information,  thereby 
preserving  discrimination  information  between  the  two  classes  considered.  It  can  be  shown  that 
the  criterion  function  J(w )  may  be  maximized  by  finding  the  projection  vector  w  which  satisfies 
the  following  generalized  eigenvalue  problem  [DHS01,  FUK90]: 

SBw  =  XSww, 
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which  leads  to 


vv  =  5;'(mi -m2).  (3.2) 

The  Fisher  Linear  Discriminant  can  be  extended  to  a  higher  number  of  classes  (called  the 
Multiple  Discriminant  Analysis  (MDA)  approach)  by  generalizing  between-class  and  within- 
class  scatter  matrices  to  the  more  than  two  classes  problem  [DHS01,  FUK90]. 

The  feature  dimension  reduction  proposed  by  Brunzell  [BRU97,  BRE99]  follows  the 
same  basic  concept  as  that  present  in  the  MDA;  that  is  to  find  a  linear  projection  that  preserves 
the  separation  between  classes.  However,  Brunzell  proposes  to  accomplish  the  task  by  defining 
a  pairwise  Mahalanobis  class  distance  measure  and  stacking  all  possible  pairwise  Mahalanobis- 
based  distances  into  a  transformation  matrix,  so  the  name  Mahalanobis-based  Distance 
Reduction  (MBDR)  approach.  The  MBDR  approach  and  the  Fisher  Linear  discriminant  are 
identical  for  the  two-class  problem  and  the  difference  between  the  two  schemes  lies  in  the 
generalization  to  the  more-than-two-classes  problem,  where  the  MBDR  scheme  preserves  the 
pairwise  approach  while  the  MDA  does  not.  Brunzell  showed  that  his  proposed  transformation 
preserves  the  separation  between  classes.  Performance  evaluations  of  the  MBDR  feature 
dimension  reduction  scheme  were  conducted  by  applying  the  proposed  scheme  to  seven  datasets 
widely  used  in  classification  benchmarking,  where  the  data  dimensions  are  reduced  to  two  and 
classification  performances  obtained  with  a  basic  quadratic  classifier  computed.  Brunzell 
showed  that  classification  performances  obtained  using  the  MBDR  scheme  are  as  good  or  better 
than  the  basic  and  variants  of  the  Fisher  LDA  approach  on  the  benchmarking  data  sets 
considered.  As  a  result,  we  will  consider  the  MBDR  approach  and  not  the  basic  LDA 
implementation  in  our  classifier  performance  comparisons. 
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B.  DESCRIPTION 


The  Mahalanobis-Based  Dimension  Reduction  (MBDR)  transformation  matrix  proposed 
by  Brunzell  is  defined  as: 


U  =  [Ci  j/n,  ,2 ,  •  •  • ,  Cjm. . ,  •  •  • ,  C;\cmc_u  ] ,  for  \<i<  j  <c. 


(3.3) 


where  Qj  are  pairwise  covariance  matrices  defined  as  Ci  }  =E,  +Sy,  nnj  are  pairwise  class- 


specific  mean  vector  differences  defined  as  m:  j  =mi  -nij,  and  c  represents  the  total  number  of 

classes.  The  feature  dimension  reduction  scheme  is  applied  by  computing  the  SVD  of  the  matrix 
U  and  selecting  as  transformation  matrix  that  which  contains  the  first  k  singular  vectors 
associated  with  the  k  largest  singular  values  of  U. 

Applying  the  MBDR  matrix  to  the  data  considered  in  Figures  II- 1  &  U-2  leads  to  Figure 
H-3  &  II-4.  Results  show  the  projection  direction  much  better  suited  to  preserve  class 
discrimination  than  PCA  is,  as  expected  from  similarities  to  the  LDA  approach. 
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IV.  KERNEL-BASED  GENERALIZED  DISCRIMINANT  ANALYSIS 


A.  INTRODUCTION 

LDA  is  a  classical  scheme  well  matched  to  classification  applications  as  it  preserves  class 
discriminations.  However,  it  may  fail  when  the  problem  under  consideration  contains  non- 
separable  class  information.  A  significant  amount  of  research  has  been  conducted  recently  in  the 
area  of  kernel-based  approaches  to  address  non-separable  class  problems.  The  main  idea  behind 
kernel-based  methods  is  to  nonlinearly  transform  the  input  feature  space  into  a  higher¬ 
dimensional  space  in  which  the  transformed  features  are  separable.  Nonlinear  transformations 
are  nothing  new  on  themselves,  however,  most  of  the  earlier  ones  involve  computations  in  the 
transformed  space  for  the  resulting  classification  set-up.  The  main  advantage  behind  the  kernel- 
based  generalized  discriminant  analysis  approach  is  the  fact  that  all  computations  may  be  carried 
out  in  the  original  space  by  expressing  the  nonlinear  transformation  in  terms  of  dot  products 
only.  Such  a  reformulation  of  the  problem  leads  to  the  computation  of  a  class  separating 
hyperplane  with  maximum  margin  without  explicitly  carrying  the  transformation  of  the  features. 
It  also  leads  to  a  nonlinear  decision  boundary  in  the  original  feature  space.  Such  nonlinear 
transformations  have  been  known  for  sometimes  but  not  taken  advantage  of  until  Vapnick 
presented  the  support  vector  machines  (SVM)  approach  [VAP95,  CHSOO].  Since  then,  several 
nonlinear  generalizations  of  algorithms  have  been  proposed;  kemel-PCA  [SSM99,  SSM98, 
TRC01,  MSS99],  kernel-based  denoising  [MSS99],  kernel-based  LDA  [MRW99a,  MRW99b, 
BANOO],  etc...  Applications  can  be  found  in  image  processing  [EPPOO,  CHV99],  pattern 
recognition  [GSOOO,  MAE99,  HAE99],  text  categorization  [TK099],  speech  processing 
[NBROO],  time  series  prediction  [MSR97],  radar  imagery  [LCBOO],  etc...  and  results  have 
shown  in  some  cases  a  significant  improvement  in  classifier  performances  over  more  established 
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methods.  Our  study  is  restricted  to  the  nonlinear  generalization  of  LDA  called  the  generalized 
discriminant  analysis  (GDA)  only. 


B.  DESCRIPTION 

The  GDA  is  an  extension  of  the  LDA  where  the  LDA  criterion  is  defined  in  the 
transformed  space.  However,  computations  are  carried  out  in  the  original  feature  space  by 
reformulating  the  GDA  criterion  in  terms  of  dot  products  of  the  nonlinear  transformation 
operation.  Recall  that  the  basic  LDA  projection  index  is  defined  as  the  direction  that  maximizes 
the  following  ratio: 


T/  N 

J(w)  =  =r^ 


W  SwW 


(4.1) 


where  SB  and  Sw  respectively  represent  the  between-class  and  within-class  scatter  matrices. 


Assume  that  we  have  N  classes  with  samples  per  class  C/ ,  i.e.,  =  M,  where  M  is  the 


;= 1 

total  data  sample  size.  Further,  assume  that  xjs  nonlinearly  transformed  into  a  different  space 
with  a  mapping  <f> : 


<p:  X^F 
x  — >  <p(x). 


The  covariance  matrix  of  the  transformed  data  <p(x)  is  given  by: 

1  M 

V  =— £).  (4.2) 

assuming  the  transformed  data  <p(x)  is  zero-mean.  The  data  can  be  centered  by  following  the 

procedure  presented  by  Baudat  &  Anouar  if  it  is  not  centered  originally  [BANOO,  Appendix  C]. 
Using  class  indices,  (4.2)  may  be  rewritten  as: 

i  M  rif 

V=T7 XS  (4.3) 

m  ;=i  k=] 

where  <j>{x;k )  is  the  element  k  of  class  i. 
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(4.4) 


where  B  and  V  are  defined  in  Eqs.  (4.3)  &  (4.4).  The  criterion  is  maximized  when  v  is  selected  as 
the  eigenvector  associated  with  the  maximum  generalized  eigenvalue  associated  with  (B,V) 
[DHS01].  Note  that  the  eigenvectors  v  may  be  written  in  terms  of  the  elements  in  the 
transformed  space  F.  Thus, 

<4-6) 

P~ 1  <7=1 

Replacing  y  by  its  expansion  given  in  Eq.  (4.6)  into  Eq.  (4.5),  Baudat  and  Anouar  show  that  the 
projection  index  J(.)  may  be  rewritten  as  [BANOO]: 


/(£)  = 


a’KWKa 
aKKa  ' 


(4.7) 


The  matrix  K  is  of  dimension  M*M  and  is  defined  on  the  class  elements  by  the  blocks  Kpq  each  of 
dimensions  np*np  .  Each  block  matrix  Kpq  is  composed  of  dot  products  in  the  transformed  feature 
space  F.  Thus: 

K  =  (Kpq  )p=l ...  „ ,  with  Kpq  =  (ky  )M ...  ,  (4.8) 

q=h—,N  j=l 

where  for  given  classes  p  and  q  the  elements  kq  are  defined  in  terms  of  dot  products  of  the 


nonlinear  transformation,  i.e.. 


(*</)„  =  <p' 

The  matrix  W  is  a  block  diagonal  matrix  of  dimension  M*M  where  each  block  Wi,  l=l,..,N  is  of 
dimension  ni*m  and  defined  as: 

[1  •••  f 

wt  =—  :  \  i  =  (4.9) 

n, 

L1  •••  l. 

Baudat  and  Anouar  show  that  the  above  generalized  eigenvector  problem  may  be  simplified  and 
reformulated  as  [BANOO]: 

Ap  =  P'WP0.  (4.10) 

Therefore,  the  GDA  problem  becomes  to  find  the  eigenvector  (3  defined  in  terms  of  the 
eigenvector  a  as  (3  =  TP' a,  where  P  and  T  are  the  eigenvector  and  eigenvalue  matrices  of  K 
respectively.  The  eigenvector  a  may  be  computed  back  from  f3  by  the  transformation 
a  =  T~]PJ3.  One  of  the  potential  drawbacks  in  the  GDA  is  the  computational  load  involved  in 

computing  the  matrix  inverse  T"1,  as  T  is  of  dimension  M*M,  where  M  is  the  dataset  size. 
However,  computationally  efficient  alternatives  have  been  reported  in  [MMR01,  LROOl, 
KMW],  Our  implementation  computes  the  inverse  T-1  with  a  reduced-rank  pseudo-inverse  to 
avoid  ill-conditioning  problems. 

Transformations  with  Gaussian  and  polynomial  kernels  have  been  used  extensively  in 
kernel-based  implementations  [CHS00,  HEA99,  MMR01,  BUR98].  We  selected  the  Gaussian 

I.  2 

x-y\  /c),  with  variable  spread  c  in  this  study  and  implemented  the 

GDA  using  MATLAB.  One  important  issue  is  the  specific  selection  of  the  spread  that  affects  the 
classification  performance,  and  Muller  et.  al.  address  the  model  selection  issue  in  their  tutorial 
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[MMR01],  However,  an  automated  selection  of  the  spread  c  was  beyond  the  scope  of  this  study, 
and  c  was  determined  by  trial  and  error  in  our  simulations. 
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V.  MEAN  SEPARATOR  NEURAL  NETWORK 

The  Mean  Separator  Neural  Network  (MSNN)  proposed  by  Duzenli  &  Fargues  belongs 
to  the  class  of  projection  pursuit  algorithms  [DUZ98,  DFA98,  DUF98],  The  basic  MSNN 
implementation  is  defined  to  differentiate  between  two  classes  {xj}  and  {yi}.  It  iteratively  looks 
for  a  one-dimensional  nonlinear  projection  direction  of  the  feature  space  that  maximizes  the 
mean  difference  of  projected  class  data  means,  for  a  user-specified  nonlinear  activation  function 
<£(.) .  As  a  result,  the  mean  difference  criterion  MD(.)  to  be  maximized  is  defined  as: 

MD(w)  =  ,  (5.1) 

where  w  is  a  column  weight  vector.  The  scheme  can  be  viewed  as  a  one-layer  back-propagation 
neural  network  (BPNN)  implementation  with  one  processing  element.  The  MSNN  was 
implemented  using  the  nonlinear  logsig  function  for  activation  function  and  gradient  descent 
with  variable  learning  rate  [DUZ98],  The  scheme  was  extended  to  classify  more  than  two 
classes  by  reformulating  the  overall  problems  as  a  set  of  pairwise  sub-problems  [DUZ98]. 
Results  showed  the  MSNN  to  lead  to  similar  or  better  classification  performances  than  more 
computational  expensive  BPNNs,  and  significantly  higher  classification  performances  than 
obtained  with  classification  trees  on  the  data  investigated.  Further  details  regarding  these 
comparisons  may  be  found  in  Duzenli  [DUZ98].  Note  that  the  MSNN  implementation  suffers  of 
the  same  drawback  as  that  present  in  other  BPNN  implementations:  the  iterative  procedure  is  not 
insured  to  converge  to  the  global  minimum  as  a  result  of  the  nonlinear  activation  function  <J>(.) 
used  in  the  projection  criterion  definition.  Therefore,  we  run  the  MSNN  a  few  times  with 
different  initial  conditions  and  selected  the  weight  vector  w  leading  to  the  best  training 
performances  in  our  simulations. 


16 


Extensions  to  the  basic  MSNN  algorithm  were  considered  by  San  Pedro  who 
investigated  the  following  projection  criterion  that  takes  into  account  both  mean  differences  and 
variance  of  the  projected  data: 


MD2(w)  =  - 


(E[d>(w'x)" 

-E~ 

var 

+  vai 

r  <D(vt/  y)  J 

(5.2) 


In  this  case,  the  goal  becomes  to  maximize  a  ratio  of  the  projected  class  means  over  the  projected 
class  variances.  The  criterion  MD2(.)  may  be  viewed  as  a  nonlinear  implementation  of  the 
pairwise  Fisher  Linear  Discriminant.  However,  this  approach  cannot  be  solved  using  eigen- 
properties  any  longer,  due  to  the  nonlinear  activation  function  <&(.) ,  and  an  iterative  procedure  is 
required  to  maximize  the  projection  criterion  MD2(.).  Various  stopping  criteria  and  slightly 
modified  versions  of  the  projection  criterion  and  data  set-up  were  also  investigated  in  San  Pedro 
[SANOO],  However,  results  showed  no  significant  classification  performance  improvements  of 
the  extensions  with  respect  to  the  basic  MSNN  implementation  on  the  data  investigated  for  the 
nonlinear  transformations  considered.  Therefore,  we  considered  only  the  basic  MSNN 
implementation  with  pairwise  coupling  in  this  benchmarking  study. 
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VI.  CLASSIFIER  PERFORMANCES  COMPARISON 


A.  DATA  DESCRIPTION 

The  MSNN,  PCA,  MBDR,  and  kernel-based  GDA  approaches  were  implemented  in 
MATLAB  and  applied  to  the  following  three  classification  problems,  commonly  used  in 
statistics  for  benchmarking  purposes,  to  evaluate  the  performances  of  the  feature  dimension 
reduction  algorithms.  All  datasets  were  obtained  from  [MLD]  and  further  details  describing  the 
feature  characteristics  and  statistics  of  each  dataset  can  found  there. 

1.  Iris  data:  One  of  the  typical  benchmarking  data  sets  selected  to  investigate  the 
performance  of  a  classifier  when  dealing  with  nonlinearly  separable  data  is  the  IRIS 
dataset  [MLD].  This  dataset  has  three  classes  with  four-dimensional  features,  where 
two  of  the  classes  are  not  linearly  separable,  while  the  third  class  is  linearly  separable 
from  the  other  two.  Twenty-five  trials  per  class  were  selected  for  training  and  for 
testing  respectively. 

2.  Handwritten  Digits  data :  This  dataset  contains  attributes  representing  normalized 
bitmaps  of  handwritten  digits  from  a  preprinted  form.  The  dataset  had  10  classes  and 
64  features  normalized  in  the  range  [0,16].  87  trials  per  class  were  selected  for 
training  and  for  testing  respectively. 

3.  Spain  E-mail  data:  This  dataset  contains  attributes  indicating  whether  a  specific  e- 
mail  can  be  considered  as  spam  or  non-spam  e-mail.  The  dataset  has  two  classes 
(spam  and  non-spam  type)  and  57  features  per  trial.  Most  of  the  features  indicate 
whether  a  particular  word  or  character  was  frequently  occurring  in  the  e-mail,  and 
further  details  regarding  each  individual  feature  can  be  found  in  [MLD].  227  trials  per 
class  were  selected  for  training  and  for  testing  respectively. 
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B.  CLASSIFIER  SET-UP 


Once  the  feature  dimensions  are  reduced  to  a  desired  user-selected  size,  classification  of 
the  data  is  obtained  by  applying  the  basic  minimum  distance  classifier  described  next.  First,  the 
training  dataset  is  used  to  obtain  the  mean  values  for  class-specific  transformed  feature  vectors. 
Such  class-specific  feature  vectors  are  selected  to  represent  each  class  and  are  called  class- 
specific  mean  feature  vectors.  During  testing,  unlabelled  feature  vectors  are  compared  against 
each  class-specific  mean  feature  vectors,  and  class  decision  made  by  selecting  the  class  which 
leads  to  the  smaller  distance  between  the  unlabelled  feature  vector  and  all  class-specific  mean 
feature  vectors. 

We  varied  the  size  of  the  projection,  i.e.,  the  size  of  the  reduced  dimension  features,  for 
PCA  and  MBDR  schemes  to  evaluate  the  sensitivity  of  the  feature  reduction  algorithm  to  the 
dimensionality  of  the  projection.  Such  a  variation  is  not  possible  for  the  MSNN  algorithm,  as  it 
implements  a  fixed  one-dimensional  projection.  However,  we  run  the  MSNN  algorithm  several 
times  for  each  training  dataset  starting  the  iteration  with  different  random  initial  values  each  time 
in  an  effort  to  mitigate  the  local  minima  issue  discussed  earlier,  and  selected  the  weights  leading 
to  the  best  training  dataset  classification  performances. 

C.  RESULTS 

Figures  VI- 1  to  VI-3  present  the  overall  classification  results  obtained  with  the  various 
dimension  reduction  schemes  followed  by  the  minimum  distance  classifier.  Overall  classification 
performances  both  for  training  and  testing  sets  are  showed  to  evaluate  any  potential 
generalization  issues.  Corresponding  confusion  matrices  are  included  in  the  Appendix. 
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1. 


IRIS  Data 


Figure  VI- 1  presents  the  overall  classification  performance  obtained  for  the  IRIS 
dataset.  Recall  that  this  dataset  has  a  relatively  low-dimensional  feature  dimension  to  start  with, 
and  that  it  was  selected  because  two  of  the  classes  (C2  and  C3)  are  not  linearly  separable,  while 
the  third  class  (C/)  is  linearly  separable  from  the  other  two.  Results  show  the  GDA  approach  is 
successful  in  separating  the  two  nonlinearly  separable  classes  while  the  MSNN  is  not.  Two 
different  implementations  of  the  GDA  with  slightly  different  overall  classification  performances 
are  shown:  Kernel- 1  and  Kernel  2. 

Kernel- 1:  spread  value  c  equal  to  1.5  and  reduced  rank  for  the  pseudo  inverse  of 
T  equal  to  75  (full  matrix  size), 

Kemel-2:  spread  value  c  equal  to  1  and  reduced  rank  for  the  pseudo  inverse  of  T 
equal  to  20  (by  visual  inspection  of  the  eigenvalue  spread  for  T ). 

Simulations  showed  that  large  variations  in  the  spread  value  may  result  in  significant 
classification  performance  differences  (when  the  spread  is  selected  too  large  or  too  small  for  the 
data  under  investigation).  Results  also  showed  the  specific  selection  of  the  reduced  rank  value 
might  have  some  impact  on  the  classification  performances.  However,  no  extensive  study  was 
conducted,  and  further  study  is  required  to  validate  these  findings.  The  results  presented  here 
show  some  slight  differences  due  to  small  variations  in  the  spread  and  the  pseudo  rank  of  T . 

The  PCA,  LDA  and  MBDR  approaches  based  on  a  two-dimensional  projection  of  the 
features  show  a  few  classification  errors  for  data  in  class  C2  and  C3  resulting  in  8%  overall 
classification  errors. 

2.  Handwritten  Digits  Data 

Figure  VI-2  presents  the  overall  classification  performance  obtained  for  the 
handwritten  digit  dataset.  Recall  that  this  dataset  has  a  relatively  high-dimensional  feature 
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dimension  to  start  with  (60  features).  Results  show  the  best  performance  is  obtained  for  the 
kernel-based  implementation,  followed  by  PCA  (when  the  projection  dimension  is  larger  than 
10),  the  MSNN,  and  finally  the  MB  DR  approach.  A  few  comments  are  in  order. 

•  MSNN  performances  vary  from  ran  to  run  due  to  the  local  minimum  issues 
inherent  in  this  algorithm,  and  two  different  runs  are  shown  here:  MSNN-tl  and 
MSNN-t2,  where  the  difference  lies  in  the  random  initial  values  selected  during 
the  training  phase.  This  result  also  further  highlights  the  fact  that  the  MSNN 
should  be  run  a  few  times  on  a  given  training  data,  and  the  version  leading  to  the 
best  performances  selected  in  an  effort  to  minimize  this  drawback. 

•  The  PCA  feature  dimension  reduction  process  clearly  degrades  the  discrimination 
quality  of  the  class  features,  as  the  classification  performances  degrade  with 
decreasing  feature  size  (projected  features  of  dimension  2,  10,  20,  30  are  shown 
here).  Simulations  showed  classification  performances  to  be  identical  for 
dimensions  30  to  40.  This  result  also  highlights  a  well-known  problem  of  PCA 
when  applied  to  classification  applications;  that  is  the  dimension  reduction 
criterion  is  not  necessarily  designed  to  preserve  class  discrimination  information. 

•  The  MBDR  scheme  also  degrades  the  discrimination  quality  of  the  class  features, 
as  classification  performances  decrease  with  decreasing  feature  sizes  (two-,  and 
four-,  and  ten-dimensional  projections  are  reported  here).  Simulations  showed 
performances  to  be  identical  for  projection  sizes  between  4  to  8. 

•  Simulations  showed  the  kernel-based  implementation  (using  a  four-dimensional 
projection)  clearly  leads  to  the  best  classification  performances  of  all  the  schemes 
considered  for  this  dataset. 
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3. 


SPAM  E-mail  Data 


Figure  VI-3  presents  the  overall  classification  performance  obtained  for  the  SPAM  e- 
mail  dataset.  Recall  that  this  dataset  has  a  relatively  high-dimensional  feature  dimension  to  start 
with  (57  features)  and  only  two  classes.  Results  show  the  best  overall  classification  performance 
is  obtained  for  the  kernel-based  implementation,  followed  by  the  MBDR  scheme,  the  MSNN 
implementation,  and  finally  by  various  implementations  of  the  PC  A  (where  2,  10,  20,  and  30- 
dimensional  projections  were  investigated).  A  few  comments  are  in  order. 

•  A  one-dimensional  projection  for  the  MBDR  approach  was  selected  as  only  as 
only  one  eigenvalue  of  the  matrix  U  defined  earlier  in  Eq.  (3.3)  was  non  zero. 

•  Simulations  showed  the  PCA  feature  reduction  scheme  has  the  worst 
classification  performances  of  all  schemes  considered,  and  that  no  improvements 
are  observed  by  increasing  the  transformed  feature  space  dimension  from  10  to 
40. 

•  The  best  overall  classification  performance  was  obtained  with  the  kernel-based 
classifier  followed  by  the  MSNN  implementation  and  the  MBDR  scheme. 
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3US  dataset;  Overall  Classification  Error  Performance 
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Figure  VI-2.  Handwritten  Digits  dataset;  Overall  Classification  Error  Performance 
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Figure  VI-3.  SPAM  E-mail  dataset;  Overall  Classification  Error  Performance 
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VII.  CONCLUSIONS 


Classification  applications  require  the  extraction  of  class  discriminative  information. 
However,  this  step  often  leads  to  high-dimensional  feature  spaces,  which  requires  large  datasets 
to  create  viable  classification  schemes.  This  study  presents  follow-on  work  to  [DUZ98,  SANOO] 
and  considers  two  discriminant-based  feature  dimension  reduction  schemes  for  classification 
applications.  The  two  feature  reduction  schemes  considered  are  the  Mahalanobis-based 
dimension  reduction  (MBDR)  scheme  recently  proposed  by  Brunzell,  and  the  kernel-based 
generalized  discriminant  analysis  approach  (GDA)  proposed  by  Baudat  &  Anouar.  The  GDA  is 
part  of  a  new  breed  of  kernel-based  algorithms  that  are  currently  being  considered  by  the 
research  community  to  develop  new  learning  techniques,  as  they  can  be  used  to  derive  nonlinear 
generalizations  of  currently  known  algorithms.  Finally,  the  classical  PCA  and  the  MSNN 
proposed  earlier  in  [DUZ98]  are  included  in  this  study  for  comparison  purposes. 

The  four  feature  dimension  reduction  schemes  considered  were  implemented  in 
MATLAB  and  evaluated  by  applying  the  transformed  features  to  a  basic  minimum  distance 
classifier.  Performances  are  evaluated  by  applying  these  schemes  to  three  datasets  commonly 
used  in  statistics  for  benchmarking  purposes.  Results  show  overall  best  results  to  be  obtained  for 
the  GDA  for  the  datasets  considered.  Results  also  show  there  is  no  consistent  second  best  feature 
reduction  scheme  among  the  MSNN,  the  MBDR,  and  the  PCA,  as  performances  for  these  three 
schemes  are  data  dependent. 

Note  that  our  investigation  of  the  generalized  discriminant  approach  (GDA)  remains 
preliminary  in  nature  as  our  study  was  restricted  to  the  Gaussian  kernel  case  only,  and  issues 
regarding  the  specific  selection  of  a  kernel  type  were  not  addressed.  In  addition,  we  did  not 
consider  issues  regarding  the  specific  selection  of  the  spread  factor  for  the  Gaussian  kernel. 
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Further  investigations  addressing  these  two  issues  would  be  needed  to  complete  the  study  of  the 
GDA  behavior.  Nevertheless,  results  are  very  promising  as  they  show  best  overall  results  for  the 
datasets  considered  to  be  obtained  with  the  GDA.  However,  the  GDA  is  also  potentially  the  most 
computationally  intensive  of  the  four  schemes  considered,  depending  on  the  size  of  the  data 
considered. 

Finally,  investigating  the  applicability  of  the  GDA  approach  to  the  classification  of 
digital  modulation  types,  and  comparing  the  resulting  performances  to  those  obtained  using  the 
higher-order  statistics  based  hierarchical  approach  discussed  in  [HAT01,  FAH01]  is  left  for 
further  study. 
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APPENDIX 


This  Appendix  contains  confusion  matrices  obtained  for  training  and  testing  sets  for  the 
three  datasets  [MDL]  selected  to  benchmark  the  feature  reduction  schemes  considered  in  this 
study 

1)  Digit  Data:  5  classes  identified  by  60  features  per  trial,  87  trials  in  training  and  testing 
datasets. 

2)  IRIS  Data:  3  classes  identified  by  4  features  per  trial,  25  trials  in  training  and  testing 
datasets. 

3)  Spam  e-mail  Data:  2  classes  identified  by  57  features  per  trial,  227  trials  in  training 
and  testing  datasets. 
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